Santander Customer Satisfaction

imagem

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving. Santander Bank is asking for help to identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late. The project has hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience. All the project info can be found in this link. The metric used to evaluate the score is AUC (Area Under the Curve)

Configure matplotib

Import Data

We will firstly dropt the ID variable has they have no valuable information to train the model

Data Exploration

There are 76020 rows (observations) and 371 variables.

All variables are numeric, being 260 integers and 111 as float.

It important to see if there is any variable with NA values.

There are no NA values in the dataset.

Another impotant part is to evaluate if there is any duplicated row in the data set.

There are 4807 duplicated rows. We will need to drop them.

Lets take a look to see if some of these varaibles might be categorical. First we will calculate the number of unique values per column.

It can be seen that 34 variables have constant values, not adding any relevant info to the data

The number of variables is now 336.

Another option to drop some variables is to see if there is a perfect negative or positive correlation between them. In order to do that, we calculate the matrix corelation, select the upper part and drop the columns which the abs(corr) == 1.

We were able to reduce the number of colums from 371 to 306, and the number of rows from 76020 to 71213.

Target exporation

We can see that the target variable is as a categorical type and that both categories are not balanced. Being 0 to satisfied clients and 1 to non-satisfied clients. The bar plot improved the visulization and it can be seen that there is only 3.95% of 'Not Satisfied' clients

Split data into train, vaidation and test

Balance train data

Dimensionality Reduction

There are several techniques to decompose the attributes into a smaller subset. These can be useful for data exploration, visualization or to build predictive models or clustering. Because our current dataframe has a lot of variables, we will build a pipeline do reduce the dimensionality of the data.

The following PCA method will be used to reduce the dimensionality of the data.

PCA

The PCA (Principal Component Analysis) is one the main methods to reduce the dimensionality of the data. It uses a matrix and linearly combines multiple columns from the original data in order to maximize its variance. Each PCA is orthogonal relative to the other PCA's and are order by it strength to explain variance.

From the interactive plot it is possible to see that with less than 30% of the the variables we can explain 95% of the total variance. Lets train the PCA again considering as a PCA number a 95% variance explained cut-off.

Logistic Regression

It can be seen that the performance on train data is greatly better when compared with the test set. In order to improve model we will perform the tuning.

Logistic Regression - Model Tuning

We will try to use a specific validation test to be use on grid search.

Best parameters:

Best model training:

Naive Bayes

XGBoost

Because XGBoost doesn't really need for the features to be scaled and centered we will do a first try with pre-processing and without pre-processing.

Without Scale and PCA

With Scale and PCA

The model that uses all variables without pre-processing presented better results, so we will tune the model in the next subchapter.

XGBoost Tuning

The best model has the following parameters:

Although the model is still overfitting, we we able to improve significantly the Validation and Test score.

LightGBM

LightGBM tunning

Best LightGBM parameters:

Predict on test data

The final predictions were submitted to the kaggle and the we got the following scores: